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Abstract 

Here, we propose a clustering technique for general clustering problems including those that 
have non-convex clusters. For a given desired number of clusters K, we use three stages to find a 
clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of 
various sizes (randomly selected). They key steps are to find a A-means clustering using clusters 
where Ki ^ K and then joins these small clusters by using single linkage clustering. The second 
stage stabilizes the result of stage one by reclustering via the ‘membership matrix’ under Hamming 
distance to generate a dendrogram. The third stage is to cut the dendrogram to get K* clusters 
where K* > K and then prune back to K to give a final clustering. A variant on our technique also 
gives a reasonable estimate for Kt, the true number of clusters. 

We provide a series of arguments to justify the steps in the stages of our methods and we provide 
numerous examples involving real and simulated data to compare our technique with other related 
techniques. 
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1 Introduction 

Clustering is an unsupervised technique used to find underlying structure in a dataset by grouping 
data points into subsets that are as homogeneous as possible. Clustering has many applications in a 
wide range of fields. No list of references can be complete, however, three important recent references 
are d], d] and [3]. 

Arguably, there are three main classes of clustering algorithm: Centroid-based, Hierarchical, and 
Partitional. Centroid-based refers to A-means and its variants. Hierarchical comes in two main forms, 
divisive and agglomerative. Partitional comes in three main forms graph-theoretic, spectral, and model- 
based. Because the scope of clustering problems is so big, all of these procedures have limitations. So, 
each major class of clustering procedures has its strengths and weakness even if, in some cases, these 
are not mapped out very precisely. 

The justification for the new method presented here is that it combines two classes of methods 
(centroid-based and agglomerative hierarchical) with a careful treatment of influential data points and 
(i) is not limited by convexity or (ii) as dependent on subjective choices of quantities such as dissimi¬ 
larities. That is, we combine several clustering techniques and principles in sequence so that one part 
of the technique may correct weaknesses in other parts giving a uniform improvement - not necesarily 
decisively better than other methods in particular cases, but rarely meaningfully outperformed. In 
particular, we have not found any examples in which our clustering method is outperformed to any 
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meaningful extent. The examples presented here dramatize this since most of them are very diffucult for 
any clustering method. One consequence of this is that our procedure is well designed for non-convex 
clusterings as well convex ones. 

A further benefit of our clustering method is that we can give formal conditions ensuring that the 
clustering will be correct for special cases. That is, we prove a theorem ensuring that the basal sets 
cover the regions in a clustering problem, in the limit of large sample size and use this to establish a 
corollary ensuring that the final clustering from our method, at least in simple cases, will be correct. 
We also give formal results ensuring that the conditions of our main theorem can be satisfied in some 
simple but general cases. To the best of our knowledge, there are no techniques, except for AT-means, 
for which theoretical results such as ours can be established. 

To fix notation, we assume n independent and identical (IID) outcomes Xj, f = 1,..., n of a random 
variable X. The Xj’s are assumed m-dimensional and written as {xn,... ,Xim) when needed. We 
denote a clustering of size K by Ck = {Cki, • • •, Ckk)', effectively we assume that for each K only one 
clustering will be generated. 

For a given K, we start by drawing a random Kb, b = 1,... ,B from a distribution that ensures a 
variety of reasonable clustering sizes will be searched. Then, the generic steps are as follows. 

1. Hybrid clustering: Create Ki clusters by iT-means. Then use single linkage (SL) clustering to 
take unions of the cluster to get clusterings of size Kb. 

2. Stabilization: Repeat stage one B times; the result is B clusterings with sizes Ki,..., Kb- From 

these clusterings, form a pooled n x Kb membership matrix M. Since each row of M cor¬ 
responds to an Xi and is a vector of zeros and ones of length the Hamming distance 

H{xi,Xj) between any two rows can be found and is between one and B Kb. These Hamming 
distances give a dissimilarity so that SL clustering can again be applied. 

3. Choosing a clustering: Use a ‘grow-and-prune’ approach on the dendrogram from Stage 2. Cut 
the dendrogram at some dis-similarity value smaller than Hk , the value of the dis-similarity that 
gives K clusters. The K* clusters are then merged to form K clusters after ignoring any clusters 
that are too small. 


The last stage involves possibly two reclusterings; One to merge the larger clusters down to K 
clusters and a second to merge the small clusters into these K clusters. The definition of small clusters 
requires the use of a cutoff value a, here taken to be 0.05; details are in Sec. 

In Step 1), there is a range of choices for dis-similarity to be used in the SL clustering. The usual 
dis-similarity, namely, defining the minimum distance between two sets the be the shortest path length 
connecting them, is one valid choice. As seen below, however, it is most effective when the data are 
generated from a probability measure P that has disjoint closed components each the closure of an open 
set. When the components of P are not disjoint, we have found it advantageous to use a robustified 
form of the minimum distance between two sets, namely the 20th percentile of the distances between 


the points in the two sets. This is discussed further in Subsec. 3.2 


There are precedents for the kind of hybrid clustering described in stages 1) and 2) that combines 
two or more distinct clustering techniques. Perhaps the closest is [1] Their central idea is to create 
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many clusterings of different sizes (by X-means) that can be pooled via a ‘co-association matrix’ that 
weights points in each clustering according their membership. This matrix can then be modified (take 
one minus each entry) to give a dis-similarity so that single-linkage clustering can be used to give a 
final clustering. [3] refer to this as evidence accumulation clustering (EAC) because they are pooling 
information over a range of clusterings. EAC differs meaningfully from our technique in three ways. 
First, in our technique we choose a single Kg^ while EAC uses a range of cluster sizes. Second, we 
ensemble (and hence stabilize) directly by membership in terms of Hamming distance whereas EAC 
ensembles by a co-association. Third, our procdure uses an extra step of growing and pruning a 
dendrogram (see Step 5 in Algoirthm ^1) that is akin to an optimization over ‘main’ clusters. Our 
‘fine-tuning’ of their technique seems to give better results. 

Another technique that is conceptually similar to ours is due to Chipman and Tibshirani [5] , hereafter 
CT. First, in a ‘bottom-up stage’, small sets of points that are not to be separated are replaced by their 
centroids. Then, in a ‘top-down stage’ the remaining points are clustered divisively to give big clusters. 
Then, the bottom up and top-down stages are reconciled to give a final clustering. Our proposed 
technique differs from CT in four key ways. First, we use iC-means in place of CT’s ‘mutual clusters’. 
Second, we use single linkage where CT uses average linkage. Third, our technique has a stabilization 
stage. Fourth, our technique uses a ‘grow and prune’ strategy, unlike CT. So,it is unclear how well CT 
performs when the true clusters are non-convex. 

A third technique, conceptually related to ours but nevertheless very different, is due to Karypis et 
al. ([6]). This technique, often called CHAMELEON, rests on a graph theoretic analysis of the cluster¬ 
ing problem and uses two passes over the data. The first is a graph partitioning based algorithm to 
divide the data set into a collection of small clusters. The second pass is an agglomerative hierarchical 
clustering based on connectivity (a graph-theoretic concept) to combine these clusters. Our method 
differs from |6] in four key ways. First, we use iC-means instead of graph partitioning. Second, we 
simply use single linkage whereas [HI combines small clusters based on both closeness and relative inter¬ 
connectivity. Third, our technique has a stabilization stage to manage cluster boundary uncertainty. 
Fourth, our technique explicitly uses a ‘grow and prune’ strategy permitting a ‘look ahead’ to more 
clusters than necessary. By contrast, CHAMELEON has an elaborate optimization. On the other hand, 
both can find non-convex clusters. 

To the best of our knowledge, the earliest explicit proposal for hybrid methods is in [7] who observed 
that using iC-means with K too large and single linkage may enable a technique to find nonconvex 
clusters. 

In addition to proposing a new hybrid clustering technique (Algorithm #1) we present a way to 
estimate the correct value Kt of K in Algorithm ^2. Essentially, we combine the first three steps of 
algorithm with a modification of [Ij. 

The rest of this paper is organized as follow. In Sec. [^we present our two algorithms for clustering 
and estimating Kt- In Sec. [^we provide justifications for some of the steps in our algorithms. For the 
steps where we are unable to provide theory, we provide methodological interpretations as a motivation 
for their use. In Sec. |^we present our numerical comparisons. Our concluding remarks are in Sec. 
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2 Presentation of techniques 


We begin with Algorithm that formalizes our generation of clusterings. It has five steps and five 
inputs: the number K of clusters to be in the final clustering, a number Kmax to be the largest number 
of clusters that we would consider reasonable, a number B of iterations of our initial hybrid clustering 
technique, a number Ki ^ K of smaller clusters that will be concatenated to larger clusters, and a 
value a to serve as a cutoff for the size of a cluster as measured by the proportion of how many of 
the Kg clusters had to be combined to create it. In practice, setting Kg = [n/5j worked reasonably 
well; however, [re/5j is an arbitrary choice and we found that adding a layer of variability by choosing 
Ki according to a , [n/6j] gave improved results. Separately, we also found that larger 

values of Kmax seemed to require larger values of B to get good results. We address the choice of B 
and Kmax later in Sec. In our work here, we merely set a = .05. This ensured that we got at least 
K clusters in our examples. Loosely, the more outliers or clusters there are, the smaller one should 
choose a. So, effectively, given Algorithm ^1, only K must be specified. The specification of K is 
done separately in Algorithm ^2. 

We begin with our clustering algorithm given in the column to the right. 


For brevity, we refer to Algorithm as SHC. 

In SHC, we have specified the use iC-means in the first part of Step 1 but left open which dissimilarity 
to use in Step 2. This is intentional because we can establish theory for our method that suggests the 
usual minimum distance dissimilarity is best when the components of P are separated (convex or not); 
however, a dissimilarity between sets based on 20th percentile of the distance between their points 
works better when the separation is not clear or entirely absent. In our examples below we denote 
these dissimilarities by writing SHCm (minimal) and SHC20 (20th percentile). 

Note that the number of clusters K* is defined internally to the algorithm in Step 4. The idea is to 
get a tree that is slightly larger than cutting at Hk, i.e., to let the algorithm search an extra few steps 
ahead for good clusters. In Step 5, any extra clusters that are found but not helpful are pruned away. 
The intuition behind the choice of K* is that the level of the dissimilarity it represents identifies the 
point at which chaining begins to affect the clustering procedure negatively. 

Algorithm ^I can serve as the basis for another algorithm to estimate Kt- We add an extra step 
derived from the method for choosing AT in [1]. Recall that [1| considered a set of ‘lifetimes’ that were 
lengths in terms of the dissimilarity. These were the distances between the values on the vertical axis 
at which one could cut a dendrogram so as to get a collection of clusters with the property that at 
least one of the clusters emerges precisely at the value on the vertical axis at which the horizontal line 
was drawn. [3j then cut the dendrogram at the dis-similarity that corresponds to the maximum of 
these vertical distances to choose the number of clusters. Algorithm #2 extends this method by using 
it once, removing some clusters, and then using it again. 

Our general procedure is given in Algorithm next page. 

For brevity, we refer to Algorithm #2 as EK. 
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Algorithm 1 Stablized Hybrid Clustering (SHC) 

1. Given K, start by drawing a value of Kg and then drawing a value of Ki, ~ DUnif{2, Kmax) 
where Kmax < for b = 1,... ,B. For each do the following with randomly generated 
initial conditions to obtain C{Kh) = {Cbi,..., CbK,,}' 

• Use standard A-means clustering (or any partitional technique) to generate a clustering 
of size Kf ‘basal’ clusters. 

• Next, use single linkage clustering (or any agglomerative technique) to merge the 
basal clusters to get a clustering Ck^ ■ 


2. For C{Ki), let Mi = ix{s,t))s=i,...,n-,t=i,...,Ki be the n x Ki membership matrix with entries 
x{s,t) = < ^ Doing the same for the rest of the C{Kb)'s generates membership 

1^0 Xg ^ 

matrices Mi,.--,Mb for clusterings C(A 2 ),..., C(Ab), respectively. Concatenating M^’s 
gives the overall membership matrix M{B) = [Mi,..., Mb]- 


3. From the n x '^b^b overall membership matrix M{B) we construct a dissimilarity matrix 
using Hamming distance. Let S = 'Yhb^b- That is, the i-th and j-th rows in M{B) are 
of the form Xi = (xj^i,..., Xj^s) and Xj = (x^q,..., Xj^s) and so give dis-similarities dij = 
d{xi,Xj) = J2m=i K^im,Xjm) where \{xim,Xjm) = 1 if Xim / Xjm and zero otherwise. So, dij 
is the number of entries in Xj and Xj that are different and 0 < dij < S. Let D = (dij) be the 
resulting matrix. 


4. Given D, use SL clustering to generate a vertical dendrogram with leaves at the bottom 
and dis-similarity values on the y-axis. Since K is given, it corresponds to a dis-similarity 
value Hk on the vertical axis, namely, Hk is the maximum dis-similarity associated with 
K clusters. Now, there will be K lines or branches on the dendrogram that cross Hk- Let 
the lengths of these lines from Hk down to the next split be denoted Li,..., hx- Cut the 
dendrogram at Hk + h and let K* be the number of clusters at that value. 


5. Write K* = K + v with u > 0. If u = 0, the clustering from Step 4 of size K is the final 
clustering. If u > 1, write v = vi + V 2 where V 2 is the number of clusters in the clustering Ck* 
for which ^{CK*j)/n < a. In the case that K clusters of size at least a do not exist, a is 
adjusted downward until K such clusters exist. Ignore these V 2 clusters and using SL (under 
the corresponding submatrix of D) recluster the points in the remaining {K -|-ui) clusters to 
reduce them to K ‘main’ clusters. Then, use SL clustering again to assign the points in the 
V 2 clusters to the K ‘main’ clusters to give the final clustering of size K. 
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Algorithm 2 Estimate of Kt (EK) 

1. Use Steps 1-3 from Algorithm ^1 to obtain D. 

2. Form the dendrogram for the data under D using SL. 

3. Use the |Jj technique to find the two largest lifetimes. 

4. For each of the two largest lifetimes, cut the dendrogram at that lifetime and examine the size 
of the clusters. Remove clusters that are both small (containing less that al00% of the data) 
and split off at or just below Hk- This gives two sub-dendrograms, one for each lifetime. 

5. For each of the sub-dendrograms, cut at Hk- This gives two numbers of clusters. Take the 
mean of these two numbers of clusters as the estimate of the correct number of clusters. 


3 Justification 

In this section we provide motivation, interpretation, and properties of the steps in the two algorithms 
we have proposed. 

3.1 A-means with large Kt 

Let the probability measure P have density p and assume that p only takes values zero and a single, fixed 
constant. The places where p assumes a nonzero value are the clusters of P. Our first result shows that 
the support of p can be expressed as a disjoint union of small clusters in the limit of large n. Let AAB 
denote the symmetric difference between sets A and B and for any set A, let diam(A) = sup^, d{x, y) 
be the diameter of A. Now, given data x"" = {xi,..., x„} write Ck = {Cri-, ■ ■ ■ Ckk) to be a clustering 
of x"' into K clusters. Our result is the following. 

Theorem 1. Suppose the following assumptions are satisfied: 

1. \/K,m 3 Cktu so that 

P{CKm A Cxm) —> 0 

in P-probability as n ^ oo. 

2. For any m, 

sup diam(C'ii-m) —^ 0 

as K ^ oo. 

3. For each m and K, P{CKm) > 0. 

4- The support ofp, supp(P), consists of finitely many disjoint open sets with disjoint closures having 
smooth boundaries. 

5. The random variable X generating x”’ is bounded. 
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Then for any fixed m, along any sequence of sets Cxm with P{diam{CKm)) > 0, there is a z ^ 
Supp(P), the support of P, so that 


E{X\CKm) Z, 


( 1 ) 


as n increases first and K increases second, at suitable rates. 

Proof. Consider a sequence {Cxm) |/f=i for which P{CKm) > 0; this is possible by items 1) and 3). 
By item 2), P{CKm) -t 0+. 

Step 1: For such a sequence, 


E{X\CKm) - E{X\CKm) A 0 : 


Begin by writing 


E{X\Ck^) - E{X\CKn 

r xdP f 


XdP 


^Cktyi P{CKm) 

r XdP 


'Cxm 

r XdP 


'CKm P{CKva) 

XdP 


+ 


'cKm P{CKm) 

r XdP 


CKm Jcktu Pi^Km) 


( 2 ) 

( 3 ) 


Since X is bounded, term © goes to zero as n —oo by the Dominated Convergence Theorem since 
I CKm ~ ^Ok fo P-probability under item 1). 

To deal with term © , write it as 



\P{CKm) 


P{CKr, 


XdP. 


Since X is bounded by M, 


say, the absolute value of term Q is bounded by 


( 4 ) 


MP{dKm) 


1 

P{CKm) 


1 

PiCKm) 


= M 


PjCKm) 

P{CKm) 


( 5 ) 


Now, by assumption 1, with probability at least 1 — rj, for any r/ > 0, as n —>■ oo we have 

P{CKm) , 

P{CKm) 

So, the factor in absolute value bars in ([^ can be made less than any pre-assigned positive number. 
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for instance, r]/M, giving that Q can be made arbitrarily small as re —>■ oo. Consequently, 


E{X\CKm) = E{X\dKm)EE{X\CKm) 
= Op{l)+E{X\CKm) 


and Step 1 is complete. 

Step 2: By item 2), 3z such that Cxm {z}- So, by Step 1, as re —)■ oo 

E{X I CKm) Z 

in P-probability. Now, to prove the theorem, it remains to show z G supp(P). 

By way of contradiction, suppose z ^ Supp(P). Then, since Supp(P) is a closed set by item 4), 

-C 

its complement is open and hence 3e > 0 so that B{z,e) C Supp(P) , where B{z,e) indicates a ball 
centered at z of radius e. However, consider a sequence of sets Cxm for some fixed rre with 

vp: : P{CKm) > 0; 

such a sequence must exist for some rre by Item 3). By Item 2), we have that 

diam(C'xm) —)• 0 as iiT —oo. (6) 


So, 3Kq such that \/K > Kq, 


CKm C i?(z,e), 

and therefore P{CKm) = 0 by letting re and K increase at appropriate rates, a contradiction. Hence, 
z G Supp(P), establishing the theorem. □ 

The utility of the theorem stems mostly from the following corollary. 

Corollary 1. There exists a Kq so that for K > Kq, there are mi ,..., mi < K for sole i with 


P (Supp(P) A (u^j^^CKm.y) < e. 


( 7 ) 


That is, there are rates at which n ^ oo, K ^ oo and e —)■ 0+, so that in a limiting sense 

Supp(P) PS ^\=iCKmy 


( 8 ) 


This corollary gives conditions under which the procedure of choosing K too large, in /C-means for 
instance, ensures that the union of the clusters for that K very closely approximates the support of X, 
regardless of whether the support is convex or not. 5disjoint open sets. 

Since assumptions 3), 4) and 5) are straightforward to assess, we provide sufficient conditions for 
assumptions 1) and 2) for the special case of ii'-means clustering. 

To do this for assumption 1), recall that ii'-means uses the Euclidean distance to define the dis¬ 
similarity d{x, x') for points x and x'. Formally, in the limit of large sample sizes, let /Xk, k = 1,..., K 







be the means of unknown classes Cktu under clustering Ck = {Cki, ■ ■ ■ ,Ckk} and let C be the 
membership function that assigns data points Xi to clusters i.e., C{i) = m Xi G Cxm under the 
clustering Ck- Then the ilT-means clustering is the Ck that achieves 

K 

miuK min^j,.,,,^^ E E \\xi-fJ.k\f. (9) 

k=lr.C{i)=k 

(Strictly speaking, the objective function in should be written in its limiting form 

K 

^ / \\X - ^iKmf dP{x) (10) 

k=l dCKm 

with the constraints ^Km = J(j^ X dP.) 

Under the ilT-means optimality criterion, given K there are iii,...,iiK £ Supp(P) such that the 
minimum in (or ®) can be written as 

K 

{x - tXKmf P{dx), 

k=\ ^Km 

with the property that 


X G Ckiti '' '' d(^X, ^Km) E di^X, jJ-Ki): ^ 7^ k. (ff) 

Defining the centroid of Ckth as 

f^Km — P (x 

with corresponding estimate defined as 

fJ-Km — P 

we can quote the following result. 

Theorem 2. Under various regularity conditions, as n ^ oo, the K-means clustering Ck is consistent 
for Ck ■ In particular, 


X G 


CKm) = X P{dx), 
JCKm 


X G CKm ) — 




X P{dx) 


hKm 


hKmi a.S. 


( 12 ) 


Proof. See Pollard (1981). 

Since ii'-means is a centroid-based clustering, we have that 

X G CKm d{x, fiKm) < d{x, fiKi), 


□ 
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Figure 1: Plots of two clusters using different choices of K to see how clustering divides the true 
clusters. 


so combining this with © we get that 


^J-Km 


fJ-Km P{Ck a Ck) —0, 


(13) 


i.e., assumption 1) is satished. 

Turning to assumption 2), consider the following example with i^-means to understand the intuition 
behind it. Suppose a data set is generated as two clusters of the same number of outcomes, one with 
high variance and one with low variance, see the upper left panel in Fig. Then, applying X-means 
with increasing K, e.g., iP = 2,4,10,14,18, 20, 24, 30 in Fig. shows that iiT-means partitions the two 
clusters more and more finely but continually assigns more clusters to the high variance data. In this 
context, assumption 2) means that as n increases, the clusters will appear to ‘fill in’ yielding K regions 
with non-void interior for each K >2 even if the n required for a given K increases with K. 

To begin formalizing this intuition, write 

K 

WSS = Y, E \\xi-^ikf, (14) 

k=li:C{i)=k 

recognizable as the ‘within clusters’ sum of squares. In the one-dimensional case, suppose 2n data 
points are drawn, = {Xi ,..., Xn) ~ f7m/[0, a] and Xg = {Xn+i, ■ ■ ■, ^ 2 n ~ Unif[2a, 4a]. Clearly, 
Xi represents a low diameter component and X 2 represents a high diameter component. If we seek a 
X-means clustering for K = 2, it is clear that X^ and X 2 should be found. However, consider K = 3. 
There are two natural clusterings. The first is to split X^ into two clusters of equal size, say Ci and 
C 2 letting C 3 = {^ 2 }- The other is the reverse: Let Ci = {Xj”} and split X 2 into two clusters of 
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equal size, say C 2 and C 3 . It is easy to verify that the population value of WSS for the first clustering 
is (9/24)a^ while for the second clustering it is (3/24)a^. This means the second clustering, splitting 
the high diameter component, gives a smaller WSS. Since iT-means chooses the mean of WSS over the 
number of clusters, in this case, X-means would choose the second clustering. 

The example can be continued for higher K, higher Kt, and other distributions continuing to show 
that it'-means tends to split the largest cluster until it is worthwhile to split the smaller cluster and then 
resumes splitting the larger cluster, and so on. A consequence of this is that Cktu tends to decrease in 
size as K increases and this suggests that Cxm will similarly decrease as assumed in Item 2) as n —)■ 00 . 
We state a version of this in the following. 

Proposition 1. Suppose a clustering method continually splits the largest cluster on the population 
level as K increases. Then, given 5 > 0, there is a Kq so that 

k > Kq diam(C'fc) < 5. (15) 

Proof. Let {hki, ■ ■ ■, k-Kx} be the centers of the optimal clusters and write 

Cktu = {x:\x- kKm\ <\X- kKe\} 


for i ^ m. Let 


Then, for K' large enough. 


do = max{diam(C'ifm) : m = 1,..., K}. 


max{diam(C'i^/m) ■ m = 1,..., K'} < —. 

Since this process can be repeated the proposition is established. □ 

Taken together, the results of this subsection justify the AT-means part of Step 1) of Algorithm ffl. 


3.2 Merging the ’basal’ clusters 

Next we turn to justifying the use of single linkage (SL) in the second part of Step 1 in Algorithm ff 
1. Recall, SL means that we merge sets that are closest i.e., given a distance d on, say, Cki, • • •, Ckk, 
SL clustering merges the two sets that achieve 

min d{CKm,CKm')- 

The question that remains is how to choose d. Here we use two choices. The first is to write d as 

dusuai{CKm,CKm') = min d{x,x'), (16) 

where d is a metric e.g., Euclidean, i.e., d^suai gives the distance between two sets as the minimum 
over the distances between their points. 
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However, being an order statistic, (16) can be affected by extreme values in the data set. So, we 


stabilize dusuai{CKmi Cxm') by replacing it with the 20th percentile of the distances between points in 
CKm and Cxm'- That is, for m / m', we find the distances 


X ) . X G CKrrn X G Cxm'}i 


take their order statistics, and find the approximately [■2^{CKm)#{CKm')\ order statistic. (Finding 
a non-integer order statistic is done internally to the R program using linear interpolation.) We call 
the resulting dissimilarity d 2 o, i.e., 


d 2 oiCKm, Cxm') = 20 — th percentile of 
{(i(x, X . X G Cxmi X G Cxm'} 


(17) 


to indicate it is based on the 20th percentile of the distances between points in the two sets. Thus, with 
d 2 o we are using single linkage with respect to a dissimilarity that should be robust against extreme 
values. Other percentiles such as the fifth or tenth can also be used, but they gave values between 
dusuai and ^20 in the examples we studied. It seemed from our work that ^20 gave the best resuts in 
cases where dusuai did not. 

For the sake of completeness we next give conditions under which dusuai can be expected to perform 
well. We are unable to demonstrate this for d 2 o but suggest there will be an analogous result since 
using d 2 o gave results that were essentially never worse (and sometimes better) than dusuai- 

Suppose Supp(P) consists of Kt disjoint regions each being the closure of an open set, assumed 
disjoint from the other open sets. Let 5 be the minimum distance between points in disjoint components, 
i.e., 

6 = min min d{x,x'). 

If n and K are chosen so large that all the for m = have diam(C'i^m) < d/3 (any 

number strictly less than 5 will suffice), choose a regular grid G of points in Supp(P) so that the 
distance between two adjacent points on the same axis is less than (l/2)(<5/3). This ensures that each 
Cktu has at least one grid point in it. The points in G are essentially a perfectly representative set 
of Supp(P) and hence of P. Now, if we apply SL with dusuai to the points in G we will always put 
points or subsets in the same component together before we merge points or subsets of any two distinct 
components. That is, the metric on G ensures that the closest point to any other point will always be 
in the same component if possible. So, we have proved the following theorem. 

Theorem 3. If the components o/Supp(P) are disjoint there is a cut point in the dendrogram of the 
SL merging under dusuai of the points in G that separates the components perfectly. 

Now, if n is large enough, the data set can be taken as perfectly representative of Supp(P) and hence 
of P, i.e., it is a good approximation to G in the sense of filling out all the components of P. Hence, 
it follows that SL using dusuai can perfectly separate the components of Supp(P), and hence of P, in a 
limiting sense. This does not require convexity of the components of P, only that the data points can 
be regarded as essentially a perfect representation of P. 
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Note that one of the key hypotheses of this theorem forces the components of P to be separated. 
In fact, this is often not the case - components may touch each other at individual points or may be 
linked by a very thin short line. In these cases, the components of Supp(P) may not have disjoint 
closures or the closure of the components may not give Supp(P), respectively. When assumption 4 of 
Theorem 1 is satisfied, we have found that dusuai works well: In a limiting sense, two basal sets from 
the same component will always be joined before either is joined to another component. However, 
when hypothesis 4 of Theorem 1 is not satisfied, dusuai does not have this property. In these cases, we 
have found d2o to work better; this is seen in Subsec. |4.2.1 It should be noted that the examples in 
Subsecs. 4.2.3 and 4.2.4 also do not seem to satisfy hypothesis 4 but for these cases dusuai and ^20 give 
comparable results. We regard this as a reflection of the fact that hypothesis 4 is necessary but not 
sufficient for the conclusion of Theorem 1. Also, although not shown here, we examined our clustering 
technique using d^ and d2o in the sense of © but found they were outperformed by at least one of 
dusuai or ^20- One point in favor of dusuai is that it is interpretable in that two points are in the same 
cluster merely if they are close enough, unlike d2o- 

Why does the 20-th percentile work well in cases where hypothesis 4 is not satisfied? While we do 
not have a formal argument, the intuition may be expressed as follows. If hypothesis 4 is not satisfied 
and the clusters are highly non-convex dusuai will be much more sensitive to the boundary values of 
the clusters than ^ 20 - Consequently, there may be overly influential data points - data points that are 
valid but far from other data points - that will affect the sequence of merges of the basal clusters in 
ways that are not representative of the support of P. Using d 2 o in place of dusuai reduces the influence 
of these data points eliminating distortions of the path by which basal clusters are merged. We do not 
have a rule for when to use dusuai versus d 2 o, however, the presence of extreme points (as opposed to 
outliers) is a good indicator that d2o should be preferred and this is consistent with all our examples. 
Indeed, the examples in Subsecs. 4.2.3 and 4.2.4[ where dusuai and <^20 give equivalent results, do not 
have clusters with extreme points. 


3.3 Using the overall membership matrix 

In Steps 2 and 3 of algorithm ^1, a composite membership matrix M{B) for B clusterings is defined. 
Then, single linkage clustering is applied to the rows of M{B) in Step 4. Because D is absed on 
M{B) our results should be robust results because by using several random starts for the clustering 
and looking only at which cluster a data point is in, we are ensuring that the final clustering is a sort 
of ‘consensus clustering’ representing what is invariant under two sorts of randomness - randomness of 
the clustering and random noise in the data points themselves. 

Our use of the matrix M{B) means our method may be regarded as an ensemble approach. Each 
set of columns in M{B) represents a clustering and pooling over clusterings in step 4 effectively means 
that we are analyzing B different clustering structures for the data. The analog of ensembling the 
matrices is played by single linkage which groups similar clusterings together. The result is that the 
final clustering is stabilized. 


13 












3.4 Growing and pruning 

In Step 4, algorithm grows a dendrogram of size K* by single linkage. In fact, K* may be bigger 
than the size K of the desired clustering. In such cases, the dendrogram ‘grown’ is too large and must 
be pruned back. This is done in Step 5. 

The benefit of this is that by growing a dendrogram a little larger than required, the method may 
look one or more splits further along so that outliers or other aberrant points may be removed. The 
outliers or other aberrant points are in the V 2 small clusters that are removed before the data are 
reclustered. Leaving out the V 2 small clusters means that the resulting clustering should be more 
stable, and therefore hopefully more accurate. Of course, the outliers and aberrant points in the V 2 
small clusters must be merged back into the clustering as is done at the end of Step 5. However, they 
are merged back into clusters they were not used to form. Hence, the final clustering may be more 
representative of P than if the extra points were used to form the clusters in the first place. 

At root, Steps 2-5 are designed to take advantage of the chaining property of single linkage. Usually, 
the chaining property is a reason not to use single linkage; here the chaining property is used only to 
fill out clusters but as far as possible not to merge them. In terms of filling out clusters, the chaining 
property is desirable. It only becomes disadvantageous when it inappropriately joins clusters. 


3.5 Estimating Kt by using lifetimes 


Steps I and 2 in Algorithm 2 have been addressed in Subsecs. 3^, and ^ So, it remains to 
justify the use of lifetimes in Steps 3-5 for estimating Kt- 

As can be seen in Fig. 3 of [3j where they give an example of lifetimes for a dendrogram, defining 
clusters by the use of a maximum lifetime has the tendency to amplify the separation between so that 
points are usually only put in their final cluster near the leaves of the dendrogram. That is, there are 
often several long lifetimes that give reasonable places to cut the dendrogram such that the clusters 
at the bottom are well separated and homogeneous in the sense that further decreases in dissimilarity 
are small. This method seems to work well when the clusters are well separated, regardless of whether 
they are convex. 

Our refinement of the [1] technique is an effort to extend it to cases where the separation among 
clusters is not as clear. Indeed, removing subsets of data that are too small before applying [3j ensures 
that likely outliers or other aberrant points will not affect the collection of lifetimes. The benefit is that 
outliers and aberrant points will rarely be seen as separate clusters, yielding a more accurate number 
of clusters. 


4 Evaluation of the our techniques 

In this section, we compare the performance of the two proposed techniques with the existing techniques 
described in Sec. using eight data sets that are qualitatively different from each other. The first 
five are the 3-NORMALS, AGGREGATION, SPIRAL, HALF-RING, FLAME data sets found at [9]. These 
two dimensional data sets are ‘shape data’. 3-NORMALS is simulated from three normal distributions 
giving convex shapes that are not well separated. AGGREGATION has one cluster that is non-convex 
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and several others that are convex bnt not separated. SPIRAL has three separated but nonconvex 
and ‘intertwined’ clusters. HALF-RING has two separated clusters that are non-convex with different 
densities making it ambiguous whether one of the clusters should be split or not. FLAME has two 
clusters, one convex, the other non-convex. The two clusters are not well-separated and the convex 
cluster has some outliers. We did not examine other shape sets at [9] because they were similar to data 
sets we had used or were too challenging for all methods. 

For fonr of these five data sets, we considered seven different clnstering techniques: iG-means (K- 
m), EAC, CT, CHAMELEON (hereafter abbreviated to CHA, spectral clustering SPECC, SHCm, and 
SHC20. We applied all seven to AGGREGATION, SPIRAL, HALF-RING, and FLAME but only applied 
CHAMELEON to one output from 3-NORMALS because CHAMELEON is intended for nonconvex data 
and is cumbersome when doing many repetitions with simulated data. 

The first five data sets are two dimensional, so it is enough to compare the output of the clustering 
techniques visually. However, because we can unambignonsly assign a ‘true’ clustering to these data 
sets, we can also calculate an accuracy index, AI, i.e., the proportion of data points correctly assigned to 
their cluster. This was calculated using the software described in |12) . Since there is randomness built 
into K-m, EAC , SPECC, and our method, where necessary i.e., for non-synthetic data, we repeated the 
techniques and report the mean AI (MAI) and its standard deviation (SAI). This was not necessary 
for CT since it does not vary over repetitions (hence SAI for CT is always zero). The results are in 
Subsecs. 14.11 and [ 42 I 

The last three data sets have four or more dimensions. The first of these is Fisher’s familiar IRIS 
data that has four dimensions. The second is the GARBER microarray data set found at m- It has 
916 dimensions. The third is the WINE QUALITY data set found at [10]. This data set has 11 variables 
and is really two data sets, one for red wine and one for white wine. We used all seven techniques on 
IRIS and GARBER but dropped CHAMELEON for WINE QUALITY because it was hard to implement 
and, being graph-theoretic, it cannot be expected to perform well on data that have many dimensions. 
We also dropped CT for WINE QUALITY since CT so rarely performed well. In these examples, the 
data sets are from classification problems so we have assumed that a unique true clustering exists as 
defined by the classes. We can again calculate the MAI and SAI as described above. The resnlts are 
in Subsec. 14.31 

In all eight examples, we set B = 200 for EAC, SHCm, and SCH20 to ensnre fairness. In the first seven 
examples we set K^ax = 25 in SHC; in the GARBER data set we chose Kmax = H because K^ax < Ki — 
2 and [n/5j = [74/5] = 14 so that it made sense to draw iC^’s from T)t/([n/6j, L^/4J) = DU{12, 18). 
The implication of our examples is that while other methods may equal or even perform slightly better 
than one or both of the SHC methods in some cases, no competitor beats them consistently by a 
nontrivial amount. 

We begin with straightforward comparisons of the clustering techniques for fixed K and then turn to 
comparing the techniques for choosing K. Specifically, for all eight data sets, we compare six techniqnes 
for choosing K, namely, the silhouette distance, the gap statistic, BIG, EAC, and two methods based 


on Algorithm ^ 2 (EKm and EK20). These results are given in Subsec. 4.4 
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4.1 Convex example: 3-NORMALS 


In order to study this convex clustering problem, consider Figure [^showing n = 120 observations that 
clearly form three clusters. The data were generated by taking 40 independent and identical draws 
from each of three normal distributions. Specifically, the there normals are 
j = 1, 2, 3 where 



and 



1.5 0 \ 

0 0.4 y ■ 


Applying the seven clustering techniques to one set of the 3-NORMALS data with Kt = 3 gives Fig. 

First, the upper left panel shows the true clusters. It appears that K-m, CT, SPECC, and SHC20 do 
roughly equally well, although spectral clustering tends to enlarge the right hand cluster unduly. By 
contrast, CHA, EAC, and SHCm do not give intuitively reasonable results. CHA makes the lower left 
cluster too small; EAC and SHCm effectively merge the right and bottom clusters. 



Figure 2: The true clustering of one run of 3-NORMALS data and seven estimated clusterings using 
K-m, CT, CHA, SPECC, EAC, SHCm, and SHC20. 

These observations are mostly but not thoroughly consistent with Table in which MAI values 
are rounded to two decimal places and SAI values are rounded to three decimal places. Table is 
the summary of the performance of the six methods over 200 simulated data sets; CHA is omitted as 
noted earlier. Clearly, despite Fig. CT and EAC are poor in an average sense (MAI) while SCHm 
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is comparable to SHC20 - suggesting the single example of SHCm in Fig. [^is unrepresentative of its 
general performance. The performance of spectral clustering is better than Fig. [^suggests but not as 
good as K-means which is best as expected. Loosely, K-m, SPECC. and the two SHC techniques seem 
to be broadly equivalent. Note that, usually, the size of the SAI values are higher for poorer performing 
clustering techniques. This can be taken as an assessment of the quality of the clustering. 


Table 1: The comparison of the proposed methods on 3-NORMALS data 

method 



K-m 

CT 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

SAI 

.97 

0.000 

0.87 

0.031 

0.94 

0.114 

0.84 

0.158 

0.92 

0.126 

0.93 

0.118 


4.2 Non-convex examples 

Here we compare all seven clustering techniques for the AGGREGATION, SPIRAL, HALF-RING, and 
FLAME data sets. 

4.2.1 AGGREGATION data 

The AGGREGATION data, depicted in the top panel of Figure]^ is used in |13j to show the performance 
of ensembling. 

If Kt = 7 is used, the clusterings from the best three methods (CHAMELEON, EAC, and SHC20) 
are shown in the lower panels of Fig. It is seen that EAC is a little worse than CHAMELEON, 
and SHC20 because it merges clusters to form cluster 2 and divides a cluster to give clusters 5 and 
6. CHAMELEON, being graph theoretic, is better at separating the two clusters while SHC20 includes 
randomness and therefore separates the two clusters slightly differently over different runs. EAC also 
includes randomness so the panels in Fig. [^merely show one run. The result of SHCm is also shown and 
is noticably worse: SHCm merges two clusters inappropriately and splits another cluster inappropriately 
into three clusters. This is the only case among the examples here in which SHCm and SCH20 give 
meaningfully different results, apparently because of the extreme points in the upper left cluster; note 
that assumption 4) in Theorem 1 is violated. 

These general appearances are consistent with the results in Tablej^ Note that CT and CHAMELEON 
have SAI zero because they are one-pass methods. The overall inference is that SHC20 and CHAMELEON 
are essentially equivalent in this example. 

Table 2: The comparison of the proposed methods on AGGREGATION data 






method 




K-m 

CT 

SPECC 

CHA 

EAC 

SHCm 

SHC20 

MAI 

0.83 

0.81 

0.92 

1 

0.95 

0.84 

0.98 

SAI 

0.000 

0 

0.044 

0 

0.000 

0.000 

0.044 
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Figure 3: Top panel: Original AGGREGATION data. Second panel: Clustering under CHAMELEON. 
Third panel: Clustering under EAC. Fonrth and fifth panels: Clustering under SHCm and SHC20 
respectively 
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4.2.2 SPIRAL data 


Figure shows the SPIRAL data that are often considered as a test case for nonconvex clustering. 
Clearly, Kt = 3 and the clusters are the three lines of points. In this run, only EAC, SHCm and SHC20 
find the correct clusters. More generally, if one uses many runs, the MAI and SAI present a slightly 
different result: The best methods are SPECC and the two SHC’s. Of these, SHC20 and SHCm should 
be preferred because SPECC has a higher SAI, as can be seen in Table 



SPECC EAC SHC20 


Figure 4: Clustering of the SPIRAL data with six different methods as indicated on the panels; SHCm 
and SCH20 are idnetical so only SHC20 is shown. 


Table 3: The comparison of the proposed methods on SPIRAL data 






method 




K-m 

CT 

CHA 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

0.34 

0.35 

0.44 

0.94 

0.90 

1 

1 

SAI 

0.000 

0 

0 

0.148 

0.134 

0.000 

0.000 


4.2.3 HALF-RING data 

Figure [^shows six clusterings of the HALF-RING data, considered in |14j (SHCm and SHC20 are nearly 
identical). Intuitively, there are two clusters but the density of the points makes it ambiguous whether 
the top half-ring should be split into two clusters or not. It can be seen that K-m and CT give poor 
performance (they merge the left half of the bottom half ring to the top half ring) but the other 
methods find the two clusters; this is seen in the results in TableNote that when SHCm and SHC20 
are essentialy the same, we only show one of them. 

While SHCm and SHC20 have slightly lower MAFs than the other three good methods (CHAMELEON, 
SPECC and EAC) they also have nonzero SAI’s indicating the ambiguity in the top half-ring. Indeed, 
when we ran SHC20 1000 times on the HALF-RING data, the two half rings were in separate clusters 
873 times while the right hand portion of the top half-ring was put in the same cluster as the bottom 
ring 127 times. The ambiguity of two versus three clusters being appropriate is recognized by the 
SHC’s whereas the SAI’s of the other three good methods (CHAMELEON, SPECC and EAC) being zero 
indicates they are not recognizing the ambiguity. 
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Figure 5: Clustering of the HALF-RING data with six different methods as indicated on the panels; 
SHCm is the same as SHC20. 


Table 4: The comparison of the proposed methods on HALF-RING data 






method 





K-m 

CT 

CHA 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

0.78 

0.78 

1 

1 

1 

0.99 

0.97 

SAI 

0 

0 

0 

0 

0 

0.044 

0.063 


4.2.4 FLAME data 


Fu and Medico m developed a fuzzy clustering technique for DNA micro-array wich they considered 
on the test data given in Figure]^ The website [9] refers to this as the FLAME data set. On this data 
set, SHCm and SHC20 are seen to be the methods that best identify the two clusters. None of the 
other five methods do as well; CHAMELEON, EAC, and SPECC fail completely because they are greatly 
distorted by the two apparent outliers. By contrast, SHCm and SHC20 deal elaborately with outliers 
to reduce their effect. K-m and CT do passably well, but put too many points in the upper cluster. 

The overall performance of the methods is summarized in Table [^indicating high MAI and relatively 
low SAI consistent with Fig. [^ 




2 4 6 6 10 12 14 0 2 4 6 8 10 12 14 





2 4 6 8 10 12 14 

SHC20 


Figure 6: Clustering of FLAME data with six different methods as indicated on the panels (SHCm and 
SHC20 are identical). 
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Table 5: The comparison of the proposed methods on FLAME data 






method 




K-m 

CT 

CHA 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

0.84 

0.84 

0.71 

0.7 

0.65 

0.89 

0.88 

SAI 

0.031 

0 

0 

0.122 

0.0000 

0.000 

0.000 


4.3 Higher dimensional data 

In this context it is difficult to generate two or three dimensional plots to evaluate how successful a 
clustering strategy is visually. One can use projections onto planes, however, doing this systematically 
increases in difficulty as the dimension increases. Hence, we only present tables of MAI and SAI values. 
As a generality, CHA can only be expected to work well when the clusters are compact and can be 
separated; this is less and less likely as dimension increases and the cases when it does occur tend to 
be those where A-means performs well and is easier to use. 

4.3.1 IRIS data 

This benchmark data set contains n =150 observations with three attributes of Iris flowers. Table 
gives the MAI’s and SAI’s for the replications of the methods on the IRIS data. Obviously, K-m, CT 
and the two SHC’s methods provide more accurate clustering than the other methods for this data. 
This is the only one of our examples where CT works well. 


Table 6: The comparison of the proposed method on the IRIS data 






method 





K-m 

CT 

CHA 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

SAI 

0.89 

0.000 

0.88 

0.73 

0.69 

0.044 

0.69 

0.000 

0.89 

0.054 

0.88 

0.063 


4.3.2 GARBER data 

To study the performance of the SHC’s with high dimensional data, we used the microarray data from 
m- The data are the 916-dimensional gene expression prohles for lung tissue from n = 72 subjects. 
Of these, five subjects were normal and 67 had lung tumors. The classification of the tumors into 
6 classes (plus normal) was done by a pathologist giving seven classes total. Accordingly, we expect 
seven clusters. The data set was constructed in m by filling in missing values estimated by the means 
within the same gene profiles. Table presents the results. 


Table 7: The comparison of the proposed method on the GARBER data 






method 




data 

K-m 

CT 

CHA 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

SAI 

0.70 

0.109 

0.63 

0.54 

0.71 

0.054 

0.80 

0.000 

0.82 

0.000 

0.82 

0.000 
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4.3.3 WINE QUALITY data 


The WINE QUALITY data is used in Cortez et al. |16) to study the classification of wines. For red 
wines, n = 4898 and seven clusters were found. For white wine, n = 1599 and 6 clusters were found. 
Both data sets (for red and white wine) have 11 attributes. Since the data sets are large, we drew 
n = 300 observations randomly from each of them. Table gives the MAI’s and SAI’s we found. In 
both cases, the best methods were the two SHC’s and EAC with the SHC’s being slightly better. The 
other three methods were worse and even EAC and the SHC’s could not be said to perform well except 
in a relative sense. 

Note that we omitted CHA from this example since it was difficult to use and, as was seen, CHA did 
poorly on the first two of these higher dimensional examples and K-m was included. We also omitted 
CT in this example since it never did well (except on the IRIS data, where three other methods did as 
well). 


Table 8: The comparison of the proposed method on the WINE QUALITY data 


red wine 



methods 



data 

K-m 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

SAI 

0.29 

0.031 

0.4 

0.070 

0.48 

0.028 

0.47 

0.031 

0.48 

0.031 



white 

wine 





methods 



data 

K-means 

SPECC 

EAC 

SHCm 

SHC20 

MAI 

SAI 

0.31 

0.031 

0.38 

0.0547 

0.44 

0.002 

0.46 

0.044 

0.46 

0.031 


4.3.4 Summary of the examples 

From these examples, it is seen that CT rarely performs well (only on IRIS) and so can be neglected. 
CHA works well only on two examples, HALF-RING and AGGREGATION, and its performance is never 
meaningfully better than both SHC’s. Likewise, SPECC and EAC are never meaningfully better than 
both SHC’s. Finally, K-m only performs really well on 3-NORMALS, a setting for which it was designed 
(and even there has close competitors like SHC20 and SPECC), and IRIS (although only trivially). The 
inference from this is that, as a generality, one of SHCm and SHC20 is the best method. The only 
meaningful difference in our examples occurred for AGGREGATION where SHC20 outperformed SHCm. 
Thus, although our theoretical results support SHCm, we are led to SHC20 as a default. 

4.4 Estimating cluster size 

Estimating the number of clusters is challenging because it requires knowing the structure of the 
underlying population something that is often not known. That is, identifying the number of groups in 
a data set is a problem that is both physically and mathematically challenging. Nevertheless, there are 
several methods to estimate the number of clusters, Kt- For instance, [18] uses the gap statistic (gap) 


22 













and it is implemented in the cluster package in R. Another popular method is the Silhouette distance 
(sil), P, implemented in the fpc package in R. In addition, one can estimate Kt using the Bayesian 
information criterion (BIC), intializing by a hierarchical clustering as in implemented, for instance, in 
mclust in R, see Hi- 

Table [i shows the estimates of the number of clusters and the true number of clusters using six 
different methods for the data sets in the previous sub-sections. The numbers in parentheses are the 
standard deviations (SB’s) for the estimates. The bottom row is the absolute error (AE) formed by 
taking the sum of the absolute differences between the true and the estimated number of clusters. A 
simple glance at the results shows that if the raw numbers are rounded to their nearest integers then 
even though the EK methods based on SHCm and SHC20 do not always identify the correct number of 
clusters, all other methods perform worse (for the data sets considered). Indeed, gap statistic, sil, and 
BIC do noticably worse. Only EAC is comparable, and it is slightly worse than EKm which is slightly 
worse than EK20, again validating our recommendation of using the 20th percentile, i.e., EK20, as a 
good default. Unfortunately, the SB’s do not seem to provide a helpful guide as to which methods 
are good; poor methods can have small SB’s and better methods can have larger SB’s unlike for the 
clustering methods. 


Table 9: Estimated numbers of clusters and their SB’s using six techniques. 


data set 

actual 



methods 



EKm 

EK20 

EAC 

Gap 

Sil 

Bin 

FLAME 

2 

2.2(0.3) 

2.1(0.2) 

2.1(0.3) 

2.7(0.8) 

4(0.0) 

4(0.0) 

SPIRAL 

3 

3(0.0) 

3(0.0) 

2.7(1.0) 

7.9(1.4) 

2(0) 

6(0.0) 

HALF-RING 

2 

2.1(0.3) 

2.0( 0.1) 

2.0(0.4) 

1(0) 

20(0) 

20(0.0) 

AGGREGATION 

7 

5(0) 

5(0) 

5(0) 

2.3(1) 

4(0) 

10(0) 

3-NORMAL 

3 

3.4(1.7) 

3.4(1.6) 

3.3(2.3) 

2.4(0.9) 

3.1(0.3) 

3.1(0.4) 

IRIS 

3 

3.0(0.0) 

2.9(0.2) 

2.0(0.3) 

3.4(1.3) 

2(0) 

2(0) 

Red wine 

6 

3.2(1.6) 

3.2(1.7) 

3.6(3.8) 

8.5(2.8) 

2(0) 

12.0(3.0) 

White wine 

7 

3.5(2) 

3.8(2.1) 

3.8(2.8) 

10.9(4.4) 

2.4(1.3) 

12.4(4.0) 

GARBER 

6 

4.2(1.5) 

4.6(2.2) 

12(10.9) 

5.7(2.5) 

2(0) 

5(0) 

AE 


10.8 

10 

15.3 

19 

35.7 

39.5 


5 Conclusion 

Our approach leads to two natural techniques that differ in the dis-similarity used in the single linkage 
step of our clustering approach. One dis-similiarity is the nsual Euclidean distance between two points. 
The other is the 20-th percentile of the distances between points in two different clusters. We can 
establish formal results for the Euclidean distance and it has a naural geometric interpretation. How¬ 
ever, the 20-th percentile dissimilarity gives performance that is no worse and sometimes meaningfnlly 
better than the Enclidean distance. 

In order to evalaute the performance of the proposed methods, we tested them on a wide variety of 
qualitatively different clusterings, including both real and simulated data, convex and nonconvex true 
clusterings, and clusterings in which the components are not well separated. Our theory and examples 
suggest that our methods lead to accurate clusterings and that using B ~ 250 to form the membership 
matrices is enough to provide satisfactory stability. Of course, for more complicated data - irregular 
shapes, little separation between shapes, etc. - more ensembling (higher B) may lead to better results. 

In onr examples, our methods effectively equal or outperform many standard or related methods 
such as spectral clustering, AT-means, EAC [3], hybrid hierachical clustering [5], and CHAMELEON [6]. 
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In fact, in all eight examples we presented here, one of the two forms of our approach always yielded 
robust and relatively accurate results. 
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